Metabolic cost as an organizing principle 
for cooperative learning 



David Balduzzi, Pedro A. Ortega, Michel Besserve 

Max Planck Institute for Intelligent Systems, Tiibingen, Germany, 
{balduzzi, ortega, besservejOtuebingen. mpg.de 



Abstract 

This paper investigates how a population of neuron-like agents can use 
metabolic cost to communicate the importance of their actions. Although 
decision-making by individual agents has been extensively studied, ques- 
tions regarding how agents should behave to cooperate effectively remain 
largely unaddressed. Under assumptions that capture a few basic features 
of cortical neurons, we show that constraining reward maximization by 
metabolic cost aligns the information content of actions with their expected 
reward. Thus, metabolic cost provides a mechanism whereby agents encode 
expected reward into their outputs. Further, aside from reducing energy 
expenditures, imposing a tight metabolic constraint also increases the ac- 
curacy of empirical estimates of rewards, increasing the robustness of dis- 
tributed learning. Finally, we present two implementations of metabolically 
constrained learning that confirm our theoretical finding. These results 
suggest that metabolic cost may be an organizing principle underlying the 
neural code, and may also provide a useful guide to the design and analysis 
of other cooperating populations. 

1 Introduction 

Rational decision making by individual autonomous agents is typically formalized as max- 
imizing a reward function via reinforcement learning [lj[2j. Although questions remain 
concerning the choice of reward function, choice of optimization method, and how this ap- 
proach relates (if at all) to human decision making, the reinforcement learning framework 
has been successfully applied to an enormous variety scientific and engineering problems. 

In the closely related setting of a population of cooperating agents we immediately encounter 
a basic question. What can one agent infer from the actions of another? Since some actions 
are invariably more important than others, communicating the relative importance of actions 
is crucial to effective cooperation. This paper shows, under fairly general assumptions, 
that taking metabolic costs into account causes agents to signal which of their actions are 
important and which are not. 

The cortex as a guided self-organizing system. The scenario we have in mind is the 
mammalian cortex, with agents corresponding to neurons. The cortex is a self-organized 
system of between about 10^ and 10^^ neurons (depending on the species) that are guided 
by neuromodulators that signal pleasure, pain and other globally salient events. Neurons 
communicate with each other via spiketrains - sequences of silences and spikes they receive 
from and transmit to 10^ to 10'' other neurons through connections called synapses. Neurons 
learn by increasing and decreasing the efficacy of their synapses according to the timing of 
pre-synaptic (input) and post-synaptic (output) spikes, as well as the presence or absence 
of neuromodulators [3j|4]. 
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The goal of the brain as a whole is to choose favorable actions for its organism in a great 
variety of situations. Ultimately, responsibility for choosing the right action falls onto the 
neurons in the central nervous system. Most neurons, however, interact with the envi- 
ronment extraordinarily indirectly - through millions or billions of other neurons - so the 
consequences for the organism of an individual neuron's actions are difficult to pin down. 
As a result, it is unclear when neurons should spike or what their spikes mean. 

Without further assumptions, there is no way ~ even assuming neurons act optimally ac- 
cording to a known reward function - for another neuron to know which of its spiketrains are 
important and which are not since, for example, neurons cannot access each other's inputs. 
In other words, without further assumptions there is no way for neurons to cooperate. 



Overview of the paper. The key to distinguishing important from unimportant actions 
is that spikes and silences are not abstract, interchangeable symbols. Spiking and responding 
to spikes carries a much higher metabolic cost than not spiking [s]. This cost is significant 
since the nervous system consumes a disproportionate share of an organism's total energy 
budget [6]. 

In this paper we attempt to distill an essential feature of cortical computation - metabolic 
cost - and explore its implications for cooperative learning. Formally, we consider agents 
that maximize rewards under metabolic constraints according to a free-energy principle 
following |7--9]. Our first result, ^ 3j is that the effective information generated by an agent's 
actions, assuming it acts optimally, quantifies the expected subsequent reward. Thus, it is 
not necessary to know an agent's reward function to compute the importance of its actions; 
all that is needed is an accurate model of its policy. 

In fact, if an agent obeys a strong metabolic constraint, so that the relative frequency of 
its actions is tightly controlled, then both the effective information and expected reward of 
its actions couple tightly to metabolic cost. Thus, if all agents have the same metabolic 
cost profile and impose similarly strong constraints, it is possible to determine the relative 
importance of actions from their metabolic cost. This shows that metabolically constrained 
agents encode expected rewards into their outputs. Intriguingly, this result may generalize 



to neurons in cortex, suggesting a novel approach to the neural code 10 



Controlling the frequency of expensive actions (spikes) also improves an agent's ability to 
estimate expected reward, ^ In the asymptotic limit, where an agent samples from its 
environment exhaustively, there is no problem. However, over short time scales (which are 
the norm in a rapidly reorganizing environment such as the brain), the expected reward does 
not necessarily coincide with the empirical reward estimate computed from a finite sample. 
It turns out that the higher the effective information generated by spikes, the more closely 
the empirical reward approximates the expected reward. 

Finally, fjs] considers some implications of these results for cooperative learning. If expensive 
actions encode expected rewards, then agents that wish to cooperate should take advantage 
of this fact by privileging expensive actions in their reward functions. We introduce two min- 
imal models of metabolically constrained learning agents whose reward functions privilege 
spikes. 



Metabolic cost thus provides an organizing principle that tightly binds information, reward 
and approximation errors together. Future work will investigate the interplay between 
metabolism and cooperation in more biologically realistic settings. 

Acknowledgements. We thank Yevgeny Seldin and Giulio Tononi for useful discussions. 



2 A minimal model of neurons in cortex 

Unique features of neurons in cortex. The brain differs from many other populations 
of interacting agents in a number of key respects, two of which we mention here. First, by and 
large neurons do not die, nor are new neurons created; the population is essentially stable. 
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Second, by and large neurons do not compete for resources, and "successful" neurons do 
not receive a significantly larger share of metabolic resources than "unsuccessful" neurons. 

We also wish to emphasize an important difference between our agents, which abstract 
certain features of cortical neurons, and "agents in the wild". Individual neurons have a 
negligible impact on what happens next in the brain: a single neuron has little effect the 
neurons it targets - each receiving inputs from thousands of other neurons - and even less 
effect on the rest of the brain. The inability of neurons to effectively manipulate their 
environment tremendously simplifies any optimization problems they may choose to solve 
(by stripping out the recursive Bellman aspect). Thus, although the brain is in many ways 
extraordinarily complex, the problems faced by an individual neuron are simpler than those 
faced by autonomous agents, precisely because their ability to project power is extremely 
limited. 



Neurons as agents. The brain is modeled as a population of K homogeneous agents, 
where each agent k corresponds to a neuron. Each agent follows its policy tt^, where TTk{a\s) 
is the probability the agent picks action a G A upon encountering situation s E S. A 
situation is a vector s = (ai, . . . , ax) whose entries are the actions of all the agents in the 
brain at the previous time step. Since each neuron is only exposed to a small fraction of 
the brain, it ignores most entries in the vector, and so effectively the policy of agent k is 
7rfc(o|s) = 7rfc(a|I]fc) for some subset C {ai, . . . , a^f }. Let P(s) denote the probability 
over the situations S, which we assume to be i.i.d. 

The population is exposed to a global neuromodulation signal n £ Af that is used to commu- 
nicate the performance of the population as a whole. The neuromodulation signal is drawn 
with probabilities P{n\s), that is, depending on the current situation. Agents are rewarded 
individually for their behavior using a reward function 

(s, a, n) i-> i?(s, a, n) 

which specifies the reward an agent obtains when it issues action a being in situation s and 
subsequently receives the neuromodulation signal n. Since individual neurons have little 
influence over their environment, we assume that the future situations experienced by the 
agent are unaffected by its output: 

(s{t + 1, . . I s{t), a{t)) = p(s{t + 1, . . I s{t)) , (2) 

where t refers to time. Clearly, this assumption fails at the population level - if populations 
of neurons had no effect on the brain state (and therefore its actions) there would be no 
point in them learning. Nevertheless since, for example, destroying a single neuron makes 
essentially no difference to a brain's functioning, we argue that Eq. S is a reasonable 
assumption at the individual neuron level. 

The goal of any agenl[^ is to issue actions that maximize the time-averaged reward 
^ Ylft=i Deflne shorthand 

R{a, s) := E[i?(a, s, N) \ p{N\a, s)] . 
for the expected reward conditional on a situation-action pair. 

Spikes are metabolically expensive. Since spikes are more expensive than silences. We 
assume there is a function c : A ^ [0, oo) assigning a cost to each neuronal action, and that 
the reference distribution takes the form P{a) = where A is a positive, monotonically 

decreasing function chosen such that J^a ^('^) = 1- 

Agents take metabolic cost into account via a cost function penalizing deviations from 
the pre-specified relative action frequencies Pi A). If we interpret information content as 
resource costs, then 

C^{a,s)^7T{a\s)log'^ (3) 
P{a) 



^ Under these conditions, agents are contextual bandit players, i.e. every neuron pl ays against a 
slot machine having odds over the rewards that depend only on the current situation 111]. 
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measures the average extra cost the neuron pays for producing action a with frequency 
7r(a|s) instead of frequency P(a) in situation s. 

We will often consider the case where neurons have two outputs: silence and a single spike. 
In this case we fix metabolic prior 

, fl— f ifa = and 

[f if a = 1, for some / < 5. 

We will also often use the notation oq for silence and ai for a spike. 

The efTective information generated by an action. To understand what agents can 
and do say to one another using their actions, it is useful to c onsider an agent as a commu- 
nication channel mapping situations to actions following [l2]. The information transferred 
by an agent on average is the mutual information lTr{S, A). 

However, since we are interest ed in t he information transferred by specific actions, we in- 
troduce effective informatiorPl [iSlflil . 



Definition 1 (effective information). 

Given an agent with policy 7r(a|s) and prior P{S) on situations, the effective information 
generated by action a A is 

ei(7r,a) D\Tr{S\a) 11 P{S)] where Tr{s\a) := ^^^^ ■ P{s), 

" Pw 

and D[u\\u] is the Kullback-Leibler divergence D[p\\q\ :~ "^Pi log 

Remark 1. The expected effective information generated by an agent is mutual information: 
E[ei{Tr,a) \ P(a)] = I^{S,A). 

Effective information quantifies the information gained about S by updating prior P{S) to 
posterior T:{S\a) via Bayes' rule. Given policy tt, effective information associates a non- 
negative scalar to each action available to the agent: 

: A — > M : a i-> ei(7r, a). (4) 



An alternate approach to quantifying information transfer is local transfer entropy 15 
which (roughly) computes the information transferred by a situation-action pair. Since m 
our setup an agent only communicates actions to other agents - and not the situations 
causing the actions - we average over situations to obtain effective information. 



3 Constrained reward maximization 



Inspired by ideas from thermodynamics a number of authors have recently investigated 
the effects of information-theoretic constraints on optimization problems fr-W . This sec- 
tion adapts these ideas to investigate how metabolically constraining reward-maximization 
impacts the information content of actions. 

The goal of each agent is to maximize expected reward, as quantified hy R : S x Ax ^ M.. 
Since agents take metabolic cost into account, we introduce the following optimization 
problem: 

Definition 2 (constrained reward maximization). 

The optimal policy solves the constrained optimization problem 



argminF^(/3) = E C^{S,A) - /S ■ R^{S,A) P{s) ■ 7r{a\s) 



(5) 



Parameter /3 controls the relative weight given to the reward and metabolic cost. 

to allow arbitrary priors instead of 



^We extend the definition of effective information in 
restricting to the uniform distribution. 
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Theorem 1 (solutions to free-energy optimization). 
The optimal policy tt* , minimizing ([s]), satisfies 

where Z{s,l3) — X^ae^ ^('^) ' e^'^^'^'"'^ is the normalization constant. 

Proof. See appendix. □ 

Theorem [T] is important because it exphcitly describes how the optimal policy encodes 
rewards into decisions. However, it is not reward but rather relative reward that determines 
an agents actions: 

Definition 3 (relative reward). 

For a given 13, let the relative reward of action a in situation s he 

R{a, s) := i?(a, s) ~ log Z{s, f3). 

The definition is motivated by the following lemma: 

Lemma 2. • Minimizing free energy with respect to rewards R{s, a) and relative re- 

wards R{s, a) yields the same optimal policy 

arg min C.^{S,A)- P- R{S, A) = argmin C^{S, A) - (3 ■ R{S, A). 

• The optimal policy has functional form 

7r*(a|s) =P(a) •e-^(^''^) 

Proof. If an agent find itself in situation s, it is the difference R{s,a) — i?(s, a') that 
determines which action it chooses. Adding constant R{s, a) i— R{s, a) + A(s) and 
R{s,a') I—)- R{s,a') + A(s) makes no difference to the actions of an optimal agent since, 
by the assumption in Eq. ^ , individual agents cannot affect the probability of situations 
arising in the future. □ 

The lemma says (optimal) agents only care about the relative reward of actions, rather than 
their absolute reward. 

Definition 4 (the importance of an action). 

The importance of action a is the expected relative reward: 



E 



i?7r(S', a) p(S'|a) = 7r(s|a) ■ R{s, a). 



ses 



We introduce Definition |4] to avoid repeatedly using the cumbersome phrase "expected 
relative reward" . 

Theorem 3 {ei quantifies importance up to a metabolic discrepancy). 
For an optimal policy tt* , effective information quantifies the importance of an action up 
to a metabolic discrepancy term: 

ei(7r*,a) = E R-^^{S,a) 

effective information 




expected relative reward „ietahoUc discrepancy 

Proof. Writing out the definition yields 

* \ / I M 7i'*(s|a) ■r-^ */ I M 7r*(a|s) 

eiiTT*,a) = J2 « log = ^ « log 

F(s) 7r*(a) 
seS ^ ' ses ^ ' 
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By Theoremjl] the optimal policy satisfies 7r*(a|s) — P{a) ■ e 



I3-B.{s,a) 



i(n*,a) = J2TT*{s\a) ■ R{s,a) - log ''*^°'^ 

S 

= E\^R^,{S,a) TT*{S\a) + log 



P(a) 
P{a) 



n*{a)' 

□ 

Theorem [3] is our first main result. It says that, if the frequency of actions under policy 
TT coincides with the metabolic prior P{A), then the importance of an (optimal) agent's 
actions coincides with the effective information it generates. 

A strong metabolic constraint. Agents "in the wild", such as organisms interacting 
with potentially hostile environments, are forced to take or not take specific actions to avoid 
death. In such a scenario, there is no reason to expect a tight relationship between effective 
information and importance, and the metabolic discrepancy is likely to be quite large. Unlike 
agents "in the wild" , neuron-like agents in cortex-like environments do not face life-or-death 
situations. Their actions are a means of communication, and only indirect influence global 
outcomes. There is therefore more freedom to constrain the policies of agents since costs 
apply directly and neuromodulatory signals are highly indirect. 

Let us consider the consequences of taking metabolic costs into account strongly 

P{a) = TT* (a) for all a e A, (MP) 

where P{a) :— j^^^ and 7r*(a) :— J2ses ^* ' ^(*) optimal pohcy tt*. Section ^jsj 
presents two learning algorithms whose policies satisfy assumption |MP[ 

By Theorem [3] effective information quantifies the importance of actions by an optimal 
agent, to the extent that the frequency of its actions coincides with the metabolically de- 
termined reference distribution P{A). To compute effective information, it is necessary to 
observe how an agent responds to different situations. Neurons in the brain do not have 
access to each other's inputs, so computing effective information is difficult. Fortunately, 
the next theorem shows that neurons do not need to compute the effective information 
generated by other neurons: 

Theorem 4 (High /3 couples ei to metabolic cost and importance). 

Suppose TT* is an optimal policy for an agent satisfying Assumption \MF\ In the deterministic 
limit lim^_>.oo, for all a £ A, we have 

ei{TT*,a) = E (S", a) = log o c(a)^ . 



effective information , , , ,- , . , 

expected relative reward f {metahohc cost) 

Proof. For a deterministic policy, ei(Tr,a) = — log7r(a), which equals Aoc(a), see paragraph 
after Eq. Effective information equals importance by Theorem |3] □ 

The theorem says that, so long as agents share a common metabolic cost profile, effective 
information and importance are aligned throughout the population. As a corollary we 
obtain that, so long as spikes are infrequent, relative reward is almost entirely concentrated 
on spikes and so silences can be ignored: 

Corollary 5 (Reward after spiking approximates total reward) 



Suppose an agent has two actions (silence Oq and spike ai) and satisfies assumption MP 
Then the relative reward after all actions by the agent is approximately the relative reward 
after spiking alone: 

e\r^,{S,A)] = 7r(ai) •E[i?^.(5,ai)] + O (7r(ai)^) . 
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Proof. Theorem [8] in Appendix |A3| shows that 

I^. {S, A) = 7r(ai) • ei(7r, ai) + O {n{ai f) . 
The corollary follows after observing that /^r* {S, A) ~ X^ae^ ^*('^) ' ej(7r*, a) and, similarly, 

Theorem [4] is our second main result. It says that when an agent strongly controls the 
frequency of its actions via the metabolic prior, assumption |MP[ the information generated 
by its actions coincides exactly with their importance, which also couples tightly with their 
metabolic cost. Moreover, by Corollary 5] agents do not lose much by only attending to 
the spikes produced by other agents and ignoring the silences. This suggests agents in a 
population sharing a common metabolic cost profile can use metabolic cost to distinguish 
between more and less important actions by their neighbors. 

4 Empirical reward estimates 

We have referred to a policy's expected reward throughout this paper. However, the expected 
distribution of rewards is not known; agents can only access empirical estimates based on 
finite samples. In a dynamically changing environment like the cortex, it is important that 
neurons revise their estimates over short time frames - i.e. accurately estimate expected 
reward from small samples. Guaranteeing the accuracy of an agent's estimate of its expected 
reward is critical, since the estimate determines the agent's policy. 

This section uses effective information to bound the difference between expected reward and 
its empirical estimate. In short: the tighter the metabolic constraint - i.e. the fewer the 
set of situations for which an agent spikes - the better the quality of its empirical reward 
estimate. 

We consider expected and empirical rewards after spiking since we are interested in the 
expected relative reward encoded in spikes. Many models of neuronal learning use correlations 
between spikes or the relative timing of spikes to control synaptic (de)potentation, so that 



it is rewards after (or just before) output spikes that determine a neuron's policy 16 
Restricting attention to rewards after spiking also provides a mechanism that forces neurons 
to specialize by focusing on a narrow band in their input stream 

The expected and empirical rewards per spike are 



E 



R{S,ai, N)^ :— ^^7r(s|ai) • P{n\s) ■ R{s,ai,n) and 



R-n 7f- R{st,o,i,nt) respectively. 



{(Sf ,nt)|7r(st) = l} 

where Ti counts spikes produced by the agent during [1,T]. 

Let us introduce notation for computations with respect to the uniform prior U{s) ~ j^, 
where \S\ is the total number of possible situations. Let 7r„(a) = ' ^(*)' 

7r„(s|a) = 7r(a|s)^^, and eiu{Tr,a) — 7r„(s|a) log Note that ei„ recovers the 

original definition of effective information in [13] , which Definition [l] generalizes. 
The following theorem is proved using a version of Occam's razor due to Seldin and Tishby 



17 . Without loss of generality we assume that rewards lie in [0,6]. Since only relative 
rewards affect the choice of optimal policy, it follows that negative rewards can be stripped 
out of the optimization problem by introducing an additive constant. 

Theorem 6 (Error bound for empirical reward). 

Suppose the agent chooses a deterministic policy n under the constraint that it spikes for 
a fixed fraction of situations: 7r„(ai) = const. Further, suppose that situations are sampled 
i.i.d. Then with probability at least 1 — 5, 



Rtt — R-n 



<b J\s\ + ^ I ^"si 

V 2ri • e'=^-('^'"i) 2Ti 
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(a) Normalized error 



(b) Brute-force optimization 



Figure 1: Rewards after spiking, (a) The normalized difference between expected and 
empirical reward, plotted against effective information, (b) The average and variance of 
the expected reward after spiking and not spiking for a policy constructed via brute-force 
optimization. 



Proof. See appendix. 



□ 



The theorem implies that policies with high ez„ have better guarantees on their reward 
estimates than those with low ei„ since decreases as x increases. 

For an agent to implement the metabolic prior, it must constrain the fraction of situations 
ygj in which it spikes. Although ez(7r, oi) ^ ei„(7r, ai), it is nevertheless the case that ei and 
eiu covary since increasing the number of situations where an agent spikes decreases both ei 
and eiu', similarly, decreasing the number of situations where an agent spikes increases both 
ei and ei„. 

Figure [l^ plots the effective information generated by spiking against the difference be- 
tween the normalized empirical and expected reward. Rewards functions of the form 
R{a,s,n) = \a\ ■ g{s,n) were drawn randomly and the expected and empirical error for 
16,000 deterministic policies with \S\ — 50, P{S) uniform and Ti = 20 were computed. 
Policies were sampled randomly with k, the number of situations causing the policy to 
spike, varying uniformly across [1, 25]. The figure shows that both normalized error and the 
standard error of the error decrease as ei increases. 

Figure [T^ thus confirms that the bound in Theorem [6] is a reasonable guide to performance 
in practice. 

Remark 2. The assumption that inputs are i.i.d. is not realistic for a system of interacting 
agents such as cortex. We make two remarks. First, if rewards are only non-zero in the 
presence of neuromodulatory signals, then the assumption is more reasonable since it states 
that situations directly preceding neuromodulator release are i.i.d. Second, sim ilar results 
have recently been obtained for non-i.i.d. scenarios using PAC-Bayes methods - and 
these may be applicable to our setting. 

Theorem [6] and the experiments in Fig. [l^ are our third main result. They show that 
the higher the effective information generated by spiking, the better an agent's empirical 
estimate of its expected reward after spiking. The metabolic prior, Assumption |MP[ thus 
controls the quality of an agent's empirical estimates of its expected reward]^ 



A related result in 



19 



is that the effective information generated by empirical risk minimization 



coincides (essentially) with empirical VC-entropy, which appears in bounds on expected risk. 
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5 Implications for cooperative learning 

The previous sections showed that mctabolically constrained agents spike in the few situa- 
tions that, on the basis of their historical experience, most frequently lead to high rewards. 
This suggests cooperating agents should treat each other's spikes as intrinsically valuable. 

Example 1 (Minimal cooperative reward function). 
Consider reward function 

R{s,a,n) = \s\ • \a\ ■ n, (S) 

where \s\ is the total number of spikes in the input, \a\ is the number of spikes in the output, 
and n is the neuromodulatory signal. 

The motivation for [S] is as follows. 

1. Neuromodulatory signal n controls the sign of the reward. Positive signals are 
reinforced and negative avoided. 

2. Agents reinforce or punish their decisions only if they actively contribute to the 
global outcome. Active contributions are distinguished by their higher metabolic 
cost. Term |a|, which is either or 1, therefore gates the reward function. 

3. An individual agent spiking in isolation cannot impact global outcomes. A simple 
estimate of the impact of a population of agents is the number that spike. The 
amplitude |s| of the reward therefore reflects local spiking activity: the more spikes 
there are, the more likely it is that the agent belongs to a population that impacted 
global behavior. Term \s\ therefore amplifies local rewards according to their likely 
global impact. 

Note that learning under reward function [S] is a population-level phenomenon due to the 
\s\ term. Thus, although individual agents are unable to affect their environment, reward 
function [S] encourages them to behave as if they are responsible for global outcomes, by 
taking responsibility for neuromodulatory signals proportionally to the size of the local 
spiking population they belong to. 



An impractical - but nonetheless instructive - learning algorithm that implements reward 
function [S] is the following. 

Example 2 (Brute force optimization). 

Given M C M, let '.— {the x ■ \M\ elements with highest magnitude^. 
Algorithm: 

1. Samples reward estimates AI — {i?(s,ai)|s G 5}. 

2. Rank elements of AI and compute Alp(^ai)- 

3. The optimal policy is 

'1 s e Mp(Qj) 



7r_BF(ai|s) 



else. 



Brute-force optimization can be visualized as moving a window of fixed size and variable 
shape over the input space, searching for the configuration that maximizes reward. It 
implements the optimization 



argmax 



^P{s)-TT{a\s)-R{s,i 



subject to: ^ 7r(ai|s) • P{s) = P{ai). 

s 

Reward function [S] ensures the agent only has to search over rewards after spiking, i.e. those 
of the form R{s,ai), s € S. 
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An important question, motivated by the result following from assumption |MP[ is how a 
strong metabolic constrain affects the encoding of rewards into spikes. The expected reward 
(not importance) after spiking is 



i?(ai) = E 



The expected reward compresses a large amount of data - specifically, the structure of dis- 
tributions p{n\s) and p{s\ai) - into a single number. Figure shows an example, where 
situations are ranked according to expected reward and the policy spikes for the 15% of 
situations with highest reward. By construction the expected reward after spiking is much 
higher than after silence, so spikes encode reward. Tightening the metabolic constraint, so 
that the policy spikes for <C 15% of situations, increases the reward after spiking. Further- 
more, the variance in reward after spiking is much lower than after not spiking and, once 
again, tightening the metabolic constraint will tend to further decrease the variance after 
spiking. 

Thus, under tight metabolic constraints, the spikes produced by brute-force optimized poli- 
cies are robust signals of high expected reward. 

More generally, consider two scenarios: 

• If the rewards are distributed evenly over Mpf^ai)^ i-e. R{s,ai) sa R{s',ai) for all 
s, s' G Mp(^ai), then spikes are robust predictors of rewards. 

• If high rewards concentrate on a small subset H of Mp(^ai), i-e. R{s', ai) <C R{s, ai) 
for s £ H and s' ^ H, then rewards after spiking are more variable. 

The first scenario is preferable when agents use spikes as local estimates of expected reward: 
variance should be as low as possible. A simple way to ensure evenly distributed rewards 
after spiking is to keep Mpi^ai) small by forcing low 7r(ai), which ties in with the results on 
empirical estimates in S|4] 



Brute-force optimization separates exploration and exploitation into two distinct steps. A 
"softer" algorithm that becomes more certain as it acquires more data is a modified version 
of Q-learning |20j : 

Example 3 (Metabolically constrained Q-learning) . 

// the agent chooses action a in situation s and subsequently receives neuromodulator n, 
then it updates the Q-matrix by 

Q(s,a) ^ Q{s,a) + a ■ R{s,a,n) — Q{s,a) , 

where a controls the rate. After updating Q, the agent constructs new policy 

Operation A^(») renormalizes the policy twice: first by Z{a) chosen such that 
^gg_5 7r(a|s)P(s) = -P(a) for all a, and then by Z{s) chosen such that X^aG^^l "'('^1'*) ~ ^ 
for all s. Setting n — 3 yields a policy that approximately implements the metabolic con- 
straint. 

Figure [2] considers how effective information and expected reward covary as agents Q-learn. 
We initialized 5000 Q-learning agents with 7r(a, s) = P(s) • P{a) for P(ai) e {0.1,0.3,0.5}. 
As the agents adapt, their policies tend to become more both more deterministic and more 
likely to spike in situations yielding higher rewards, so as agents adapt they both generate 
more effective information and more rewards after spiking. The tighter the metabolic con- 
straint (i.e. the lower P(ai)), the higher the expected reward after spiking. Thus, effective 
information provides a reliable guide to expected reward, even in the presence of a strong 
metabolic constraint. 

Remark 3. Under metabolic Q-learning, term \s\ in reward function^ uses the degree of 
local consensus to control the speed with which an agent learns. 
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(a) Effective information from spikes 



(b) Expected reward after spiking 



Figure 2: Metabolically constrained Q-learning. As agents learn, effective information 
and expected reward increase in qualitatively the same way. Tighter metabolic constraints 
yield both higher effective information and greater expected rewards. 



An important point is that both brute-force optimization and Q-learning are not practical 
learning rules; they simply construct lookup tables. In contrast, cortical neurons do not 
control the probability of spiking for individual inputs - but rather for individual synapses. 
When a neuron potentiates a synapse, its response to many inputs is altered. Neurons thus 
generalize in a manner that depends on the structure of their dendritic tree. 

This section presented two simple implementations of metabolically constrained learning 
that privileges spikes. Future work will investigate how populations of such learners behave. 



6 Discussion 



The space of possible policies that the cortex as a whole could implement is vast: it consists 
of millions or billions of agents each receiving inputs from thousands or tens of thousands of 
other agents. Cutting down the size of the search space is crucial. This paper has shown that 
maximizing reward under strong metabolic constraints leads to a tight coupling between the 
metabolic cost, information content and importance of actions. This suggests that it may 
be useful to focus the attention of agents by privileging spikes during learning. 

In summary: 

1. Agents attempt to maximize their expected reward, constrained by the metabolic 
cost of their actions. 

[ Definition [2j Assumption MP ] 



If an agent behaves optimally then the information generated by its actions aligns 
with both their importance and their metabolic cost. 
[ Theorem (a] Theorem [I]] 

The less frequently an agent uses expensive actions, the better its empirical reward 
aligns with importance. 

[ Theorem [6| Figure [l^ ] 
The total contribution of metabolically inexpensive actions is negligible. 

[ Corollary [5]] 

If agents share a common metabolic cost profile, then it is obvious which actions 
predict high reward 

[ Spikes, with high confidence ] 

Therefore, agents in a cooperating population should treat costly inputs and outputs 
(spikes) as intrinsically valuable. 
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[ e.g. reward function [S] Figure [2]] 

An intriguing analogy is that spikes form a cortical currency which neurons use to commu- 
nicate estimates of future rewards. If neurons spike too frequently, then the reward encoded 
in spikes drops, analogously to how inflation degrades the worth of a currencjj^ Money pro- 
vides a measure of importance that agents in a market economy use to guide their behavior. 
If all goes well, agents with consistently contribute to the global good by performing strictly 
local optimizations. 

Spikes, and more generally metabolically expensive actions, thus provide local guidelines 
that a populations of cooperating agents can exploit during learning. 

Rewarding spikes introduces a tendency towards runaway spiking (e.g. epileptic seizures) 
which needs to be controlled. Lateral inhibition, which we have not discussed in this paper, 



is a mechanism to impose sparse activity in the brain |22 23 that may play a useful role in 
other cooperating populations. 

Imposing the metabolic constraint requires active effort. In the case of biological neurons 
it has been proposed that controlling synaptic efficacies, and so the tendency of neurons to 



with this hypothesis, see 25 ■ 27 



spike, is a function of slow-wave sleep 24 . Recent experiments have found evidence in line 



The results in this paper depend on specific assumptions and model choices. However, we 
believe these can be relaxed without substantially changing the main message. Futur e work 



will investigate the computational role of spike-timing dependent learning in cortex 16 28 
from this perspective, to better understand how neurons encode information and rewards 
into spikes. 
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Al Proof of Theorem [T] 
Theorem [TJ 

The optimal policy satisfies 



where Z{s, /3) = Y^aeA ^(°^) 

Proof. The proof is adapted from the information bottleneck method in [?]. Let A(s) be 
Lagrange multipliers, and consider 

F^=J2 P{s)7rHs) log -PY. PisMa\s)P{r\a, s)R{s, a, r) - ^ X{s)n{a\s). 

s,a ' a,s,r s,a 

The derivative with respect to 7r(a|s) is 

- /3 ^ F(s)P(r|a, a, r) - f3X{s). 

r 

We have that g'^^^fj) — because P(s) does not depend on the choice of policy by the 
assumption in Eq. (|2|. 



g/3-fl(s,a) 

n{a\s) = P{a) ■ 

Z[s,P) 

■ e^'^i^-'^) is the normalization constant. 



SF 
57r{a\.s] 



= P{s) 



1 + log 



7r(a|s) 
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Rewrite the derivative as 
5F 



P{s) 



STr{a\s) 
and solve for zero to obtain 

as desired. 



log ^^7^ - /? E Pi^)Pir\a, s)R{s, a, r) - /S-^^''^ 



P{a) 



TT{a\s) = P{a) 



□ 



A2 Proof of Theorem [6] 

Occam's razor can be paraphrased to say that the simplest hypothesis should be preferred. 
Suppose we have a setof hypotheses Ti, with prior distribution P{h) on T-L. Let — \ogP{h) 
denote the complexity of hypothesis h. Let L : A" x H — > [0, 6] be a loss function. Then 

Theorem 7 (Occam's razor). 

For any data generating distribution on X and any prior distribution P{h) over %, with a 
probability greater than \ — 5 over drawing an i.i.d. sample from X of size T , for all h Cz H: 



L{h) - L{h) 



<b- 



'-logP(/i)+logf 



2T 



Proof. See 17 



□ 



Let V, — {n : S ^ a} denote the set of deterministic policies, where A~ {an,ai} = {0,1}. 
Define loss function 

L : (^S xjV'j xU — >M.:{s,n)xTT^ R{s, 7r(s), n). 

Further, set probability distribution P{s,n) — P{s) ■ P{n\s) on 5 x Af. Theorem [t] holds 
for any sampling distribution. In particular we may use the policy tt to restrict samples to 
situations that cause the agent to spike to obtain P{s,n\n{s) — ai) — 7r(s|ai) • P{n\s). It 
follows that 

L{Tr) = i?7r = E i?(s, ai, 7i) 1 7r(s|ai) • P{n\s) is the expected reward and 

L{tt) = i?7r = — 'S^ R(st,ai,nt] is the empirical reward, 

Ti ^ — ^ 

{(st,nt)|7r(st)=ai} 

where T\ is the number of spike produced by the agent during [1,T]. 
Theorem [6l 

Let Ti. D T-Lk = {tt : iS — > s.t. |7r~^(ai)| ~ fc} denote policies that spike for exactly k 
situations. Given the setup above, with probability at least 1 — 5, 



Rtt — R-n 



<b-\ \S\- 



ei^{TT,ai) + 1 logg 
2Ti • e^^"('^.'^i) 2Tx 



Proof. Let A'' = |5| denote the number of possible situations. We put the uniform prior on 
'Hk^ so P(7r) = By Stirling's approximation, log (j,) < fclog ('^), and it follows that 

f N\ N ■ e k f N 

-logP(7r) = logL j <fclog— ^iV--. (log- + l 



The theorem follows since 7r„(ai) = and ez„(7r, oi) = log ^ 



□ 
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A3 Effective and mutual information 

For an agent operating under high metabohc constraints, it turns out that effective infor- 
mation provides a useful approximation to mutual information: 

Theorem 8 (effective information approximates mutual information). 

Suppose an agent has two actions (silence gq and spike ai) and produces spikes very infre- 
quently: 7r(ai) <C 1. Then the total information transferred by the agent is approximately 
the information it transfers using spikes alone: 



Thus to first order in 7r(ai), 7r(s|ao) is of the form p + 6p where J 5p = 0. We can then 



I{S; A) = 7r(ai) • ei{TT, ai) + O (7r(ai)^) . 



Proof. Observe that 



P{s\ao) = 



P{s) - 7r(s|ai) ■ 7r(ai) 
1 - n{ai) 

P{s) + n{ai){P{s) - n{s\ai)) + 0{7r{aif). 



compute 




where a = 



1 

ln2" 



Substituting p = P{s) and 6p = 7r(ai)(P(s) — 7r(s|ai)) gives 
D[w{S\ao)\\P{S)] =D[p + Sp\\p]+0{7r{aif) 




Thus I{S;A)='jT{ai)D[jT{S\ai)\\P{S)] +0(7r(ai)2). 



□ 
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